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Empirical studies have revealed remarkable perceptual organization in neonates. Newborn 
behavioral distinctions have often been interpreted as implying functionally specific 
modular adaptations, and are widely cited as evidence supporting the nativist agenda. 
In this theoretical paper, we approach newborn perception and attention from an 
embodied, developmental perspective. At the mechanistic level, we argue that a 
generative mechanism based on mutual gain control between bilaterally corresponding 
points may underly a number of functionally defined "innate predispositions" related to 
spatial-configural perception. At the computational level, bilateral gain control implements 
beamforming, which enables spatial-configural tuning at the front end sampling stage. 
At the psychophysical level, we predict that selective attention in newborns will favor 
contrast energy which projects to bilaterally corresponding points on the neonate subject's 
sensor array. The current work extends and generalizes previous work to formalize the 
bilateral correlation model of newborn attention at a high level, and demonstrate in 
minimal agent-based simulations how bilateral gain control can enable a simple, robust 
and "social" attentional bias. 
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1. INTRODUCTION 



The empirical position is, to be sure, in agreement with the 
nativistic on a number of points — for example, that local signs 
of adjacent places on the retina are more similar than those far- 
ther apart and that the corresponding points on the two retina are 
more similar than those that do not correspond. Helmholtz (1968) 



Psychophysical studies have demonstrated quite sophisticated 
spatial-configural perception in newborn babies. See for exam- 
ple Johnson et al. (1991); Slater and Kirby (1998); Farroni 
et al. (2000); and Streri et al. (2013a). Such findings have often 
been interpreted in terms of functionally defined "innate pre- 
dispositions" (Morton and Johnson, 1991; Spelke, 1998; Streri 
et al, 2013b), which are seen as providing the "biological basis" 
of perceptual and social development (Johnson and Morton, 
1991; Johnson, 2003). Other researchers have suggested that con- 
specifics are identified as "like me" via perceptuomotor resonance 
with internal representations of the self, developed prenatally 
through self-exploration and proprioception (Sugita, 2009; Pitti 
et al, 2013). Some have even proposed that an innate precursor of 
the mirror neuron system underlies newborn sociality (Meltzoff 
and Decety, 2003; Lepage and Theoret, 2007). Here we extend and 
generalize previous work on newborn face preference (Wilkinson 
et al., in press), to outline a high level formalization of a novel 
perspective in which embodied sampling biases provide "innate 
information" about space, configuration, and "like-me-ness." 



Specifically, the sampling bias under focus in the current paper 
is bilateral sensor distribution and integration. Many authors 
have noted the important role of multimodal correlation in 
social interaction and perceptual learning (e.g., Bahrick et al., 
2004; Sai, 2005). Less attention has been paid to the implications 
of intramodal bilateral correlations from the social and behav- 
ioral perspective, though bilateral sensory interaction has been 
intensively studied in its own right (see "Bilateral mutual gain 
control" below). Here we hypothesize a functional relationship 
between bilateral interaction and spatial-configural perception 
in newborns. A simplified formal model explains how bilat- 
eral mutual gain control can enable spatial-configural perceptual 
distinctions analogous to some of those observed in human 
newborns. Agent-based simulations address minimal analogs of 
visual perception of "size constancy" (Granrud, 1987; Slater et al., 
1990), visual and audio-visual face perception (Goren et al., 
1975; Johnson et al., 1991; Sai, 2005), and the dependency of 
face perception and social learning on infant directed align- 
ment, or "direct gaze" (Farroni et al, 2002; Guellai and Streri, 
2011). 

In section 2, we present a case for bilateral mutual gain con- 
trol as a general aspect of intersensory integration, based on the 
existing literature, and recruit the array signal processing for- 
malism of beamforming (Naidu, 2001) as a way of describing 
spatial-configural attention and orienting. In section 3, we define 
a minimal formal model to describe how bilateral gain control 
can generate an overt attentional preference for the "like me." In 
section 4, we report results of simulations based on this model. 
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Finally, we discuss the relevance and scope of our theoretical 
findings, and offer some concluding remarks. 

2. THE NEWBORN AS A MULTIMODAL, BILATERAL SENSOR 
ARRAY 

Gibson (1966) argued that the sensing body should be charac- 
terized as a multimodal sensor array. In any sensing process, the 
front line is the sampling regime adopted; neural/ computational 
processing can only process that which has been sampled. It is 
therefore impossible to meaningfully characterize sensory input 
to the nervous system during active behavior without consider- 
ing the physical embodiment of the sensor array (Towal et al., 
2011). The vertebrate sensor array is structured in a more or 
less bilaterally symmetric manner. A growing consensus points to 
mutual gain control as a fundamental feature of bilateral interac- 
tions (e.g., Li and Ebner, 2006; Ding et al, 2013; Schmidt, 2013; 
Wunderle et al, 2013; Xiong et al., 2013). Mutual gain control 
is one mode of enacting beamforming (e.g., Westermann, 2003; 
Ma, 2013), a standard technique for selective tuning in sensor 
array technology. In this section, we unpack this analogy between 
selective attention in newborn infants and selective tuning in sen- 
sor array theory. First, we review the neurobiological literature 
on bilateral mutual gain control and the relationship of gain con- 
trol to attention. This sketches a plausible neural substrate for the 
current model, which we term the "bilateral correlation model" 
("BCM") of newborn attention. We then briefly introduce beam- 
forming and explain how it is related to bilateral gain control and 
attention. 

2.1. BILATERAL MUTUAL GAIN CONTROL AND SENSORY 
ATTENTI0NAL GATING 

2. 1. 1. Bilateral symmetry structures vertebrate physiology 

The bilateral structure of the brain and body is aligned and inte- 
grated according to symmetric correspondence at many stages 
of sensory and motor processing, an architecture perhaps most 
clearly demonstrated by the corpus callosum (Iwamura, 2000; Li 
and Ebner, 2006), which links corresponding bilateral points in 
the brain. Tactile stimulation of one hand causes bilateral corti- 
cal activation at corresponding somatotopic points (Hansson and 
Brismar, 1999). Binocular cross-correlation is widely thought to 
underly stereo vision (Banks, 2007; Filippini and Banks, 2009). 
Binaural cross-correlation informs orienting and looking behav- 
ior in neonates (Mendelson et al., 1976; Jiang and Tierney, 1996; 
Furst et al., 2004), and the foetus is capable of auditory orienting 
in utero (Voegtline et al, 2013). From an aesthetic perspective, 
the "sweet spot" region of binaural synchrony is manipulated 
by sound engineers to deliver the most enjoyable and engag- 
ing listening experience (Theile, 2000; Bauck, 2003), suggestive 
of a more general multimodal link between bilateral correlation, 
arousal and "liking." 

2. 1.2. Neural gain control implements selective attention 

Neural gain control can implement multiplicative interactions in 
vivo (Rothman et al, 2009). Gain control acts like an ampli- 
fier, or "gate" for input signals. Gain control is widely thought 
to mediate selective attention (Hillyard et al., 1998; Salinas and 
Sejnowski, 2001; Aston-Jones and Cohen, 2005; Reynolds and 



Heeger, 2009; Feldman and Friston, 2010; Katzner et al., 2011; 
Sara and Bouret, 2012), and has been mechanistically linked to 
ascending projections from neuromodulatory hubs and the sym- 
pathetic nervous system (Aston-Jones and Cohen, 2005; Voisin 
et al, 2005; Sara, 2009; Fuller et al., 2011). Presynaptic sychrony 
can also modulate postsynaptic gain in a feedforward fashion 
(Huguenard and McCormick, 2007; Womelsdorf and Fries, 2007; 
Gotts et al., 2012). Gain control has been formally equated with 
the modulation of Bayesian precision in probabilistic generative 
modeling (Feldman and Friston, 2010; Moran et al., 2013). 

2. 1.3. Bilateral mutual gain control 

There is extensive evidence that mutual gain control is an impor- 
tant aspect of binocular (Ding and Sperling, 2006; Meese and 
Baker, 2011; Ding et al., 2013), binaural (Kashino and Nishida, 
1998; Ingham and McAlpine, 2005; Steinberg et al, 2013; Xiong 
et al., 2013), and bitactile (Hansson and Brismar, 1999; Li and 
Ebner, 2006) interactions in various species. Interhemispheric 
interactions via corpus callosum are well described by a gain con- 
trol relation (Li and Ebner, 2006; Schmidt, 2013; Wunderle et al., 
2013). Thus mutual gain control is the most plausible general 
framework for bilateral sensory interaction, though many par- 
ticulars are likely to exist at a more detailed level. Given this 
organization, a behavioral preference for stimuli which induce 
strong correlations between corresponding points is practically 
inevitable. 

2.1.4. Bilateral gain control in the neonate 

Studies examining prestereoptic binocular vision in human 
infants have produced conflicting results, and have not included 
neonatal subjects (Shimojo et al, 1986; Brown and Miracle, 2003; 
Kavsek et al., 2013). The maturity of binocular gain control cir- 
cuitry in human newborns is therefore unknown. In the rhesus 
macaque, considered a good model for the human visual sys- 
tem, binocular circuitry is quite mature in neonates (Rakic, 1976; 
Horton and Hocking, 1996), and responses are limited by low 
monocular sensitivity rather than binocular immaturity (Chino 
et al, 1997). Binaural integration is also functioning, if not 
entirely mature, in neonates (Furst et al, 2004; Litovsky, 2012). 
The BCM assumes that newborn bilateral integration is a qualita- 
tively similar, if immature, version of that in adults, but does not 
prescribe the precise transform; many variations on mutual gain 
control are possible. In adults, bilateral interactions are modu- 
lated by mono and stereo normalization (Carandini et al., 1997; 
Moradi and Heeger, 2009; Carandini and Heeger, 2011), but the 
nature of normalization in the newborn human is, as far as we 
know, unknown. 

2.2. THE THALAMUS— AN ARCHITECTURAL HUB FOR ATTENTI0NAL 
GATING AND BILATERAL ALIGNMENT 

2.2. 1. Bilateral alignment in the thalamus 

The thalamus embodies architectural alignment of the signals 
from bilaterally corresponding sensors (Jones, 1998). For exam- 
ple, the lateral geniculate nucleus (henceforth "LGN"), is com- 
prised of six layers. Connections from contralateral nasal retina 
project to layers 1, 4, and 6, while ipsilateral temporal retina 
projects to layers 2, 3, and 5. All of these layers are precisely 
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in-reference with respect to retinal receptive fields. This archi- 
tecture forms a structural basis for precise binocular receptive 
fields in efferents such as primary visual cortex. Bilateral audi- 
tory (Wrege and Starr, 1981; Fitzpatrick et al., 1997; Ingham and 
McAlpine, 2005; Krumbholz et al, 2005) and tactile (Mountcastle 
and Henneman, 1949; Emmers, 1965; Davis et al., 1998; Coghill 
et al., 1999) pathways also converge in the thalamus and earlier. 

2.2.2. Sensory gain control and attention in thalamus 

An active role for the thalamus in attention has long been theo- 
rized (Clark, 1932; Crick, 1984), and evidence supporting these 
suggestions is accumulating (Varela and Singer, 1987; Sillito et al., 
1994; Sherman and Guillery, 2006; Sherman, 2007; Saalmann 
and Kastner, 2011; Ward, 2013). The thalamus is strongly asso- 
ciated with gain control (Saalmann and Kastner, 2009). LGN 
receives only about 10% of its input from retina, with the other 
90% constituted in approximately equal proportions by cholin- 
ergic projections from the parabrachial nucleus of the brainstem, 
inhibitory control from the thalamic reticular nucleus, and feed- 
back connectivity from layer 6 in striate cortex (Saalmann and 
Kastner, 2009). Recently, layer 6 in visual cortex has been shown 
to mediate gain control of superficial layers (Olsen et al., 2012; 
Velez-Fort and Margrie, 2012). This extensive modulatory net- 
work effects gain modulation in LGN, thereby gating visual input 
to the cortex (Sherman, 2007; Saalmann and Kastner, 2011; Lien 
and Scanziani, 2013). Corticofugal feedback also modulates gain 
control in auditory thalamus (Grothe, 2003; He, 2003; Zhang 
etal., 2008). 

2.3. BEAMFORMING. ORIENTING. AND MOTOR ATTENTION 

Beamforming, a form of spatial filtering, is a technique for 
manipulating the spatial tuning of a sensor array (Naidu, 2001). 
Applications include astronomy (e.g., the SCUBA-2 project '), 
neuroimaging (Van Veen and Buckley, 1988; Siegel et al., 2008), 
"smart" audio technology (Kellermann et al., 2012; Sun et al., 
2012), and network communication (Litva and Lo, 1996; Lakkis, 
2012). The mathematical essence of beamforming is maximiza- 
tion of constructive interference between the signals from an 
array of sensors. Integrating the signals from an array of spa- 
tially distributed sensors creates a "beam," or preferred angle 
of arrival, for incoming signals. Mutual gain control is one 
possible integration function e.g., Ma (2013). When the sig- 
nals from all the sensors are temporally aligned, constructive 
interference is maximized and the input signal is faithfully repro- 
duced. Adding differential delays to the sensor inputs, or phys- 
ically turning the array, can rotate this beam in space, so that 
sources at particular locations (e.g., a mobile phone) can be 
targeted, whilst noise from elsewhere is tuned out; a kind of 
technological "selective attention." See Figure 1 for a schematic 
depiction. 

For example, in delay-and-sum beamforming, a large set of 
delays is provisionally added to the signals from the array. The 
particular delay(s) which maximizes the constructive interference 
with a reference sensor (i.e., maximizes the combined signal from 



^ttpV/www.jach. hawaii.edu/JCMT/continuum/scuba2. html 



Beam 
Pattern 




FIGURE 1 | A schematic depiction of beamforming. The angle of arrival 
to which the array is tuned is represented by the "beam" projected onto 
space. This may be steered by adding delays at the integration step, or by 
physically turning the array. 



the whole array) is identified. This optimum delay is proportional 
to the angle of arrival of the signal and may be calculated as; 



(i — l)dcos6 
A(tf) = ,i 



1,2. ..« 



(1) 



Where d is inter-sensory distance, c is the traveling speed of the 
signal (e.g., the speed of sound), 9 is the angle of arrival of the sig- 
nal, and i represents the position of the sensor concerned relative 
to the reference sensor. More than two sensors can of course be 
used. In practive, 6 is usually not known. 

2.3.1. Audio Beamforming 

Beamforming is mathematically similar to the widely accepted 
notion of sound localization based on inter-aural time difference 
("LTD") (Jeffress, 1948; Fitzpatrick et al, 2000, 2002; Joris and 
Yin, 2007; Hartmann et al, 2013). Binaural hearing aids employ 
beamforming to give directional selectivity, allowing the device 
to focus on sound sources at the auditory midline (Greenberg 
and Zurek, 1992; Kompis and Dillier, 2001; Westermann, 2003; 
Ma, 2013). A recent study found that auditory adaptive coding 
mechanisms primarily target sources near the interaural midline 
(Maier et al., 2012), suggestive of a "special" attentional status for 
midline sources. Audio beamforming is fundamental to techno- 
logical approaches to the "Cocktail Party Problem" (Cherry, 1953; 
Haykin and Chen, 2005). 



Frontiers in Neurorobotics 



www.frontiersin.org 



February 2014 | Volume 8 | Article 9 | 3 



Wilkinson and Metta 



Bilateral gain control in neonates 



2.3.2. Motor attention and beamforming 

Adding the delay A(f;) steers the angle of arrival to which the 
array is tuned, and is equivalent to physically turning the array. 
Physically turning the array is analogous to the psychological 
concept of orienting or overt attention. Adding delays to "virtu- 
ally" orient the array is analogous to "covert attention." Covert 
attention and overt attention are thought to be tightly linked 
(Eimer et al., 2005; de Haan et al., 2008), though appear to be 
mediated by different cellular networks (Gregoriou et al., 2012). 
Physically turning the array provides the basis for the motor 
side of selective attention in the current model. A movement 
equivalent to the delay is enacted in order to physically align 
the array with the source. We deal only with the overt orient- 
ing case here. Thus A(f,) corresponds to a motor command, 
rather than an internal delay. Here we reference this difference 
by the term "active beamforming." This may seem a little topsy- 
turvy, as beamforming was to a large extent designed to avoid 
the need for slow and energy hungry mechanical orienting of 
the array, but we believe it conveniently expresses the underly- 
ing continuity between mechanical and "virtual" manipulation of 
alignment. 

2.3.3. Visual beamforming 

The speed of light renders delays virtually undetectable. Light's 
beamlike propagation and the design of visual optics provide 
direction selectivity a priori. Thus visual beamforming is not stan- 
dard practice, and the terminology adopted here may thus be 
non-standard. However, the mathematical concept at the heart 
of beamforming — maximizing constructive interference between 
sensors — applies equally to the visual case in the current model. 
As the underlying sensory computations are the same between 
vision and audition, we adopt the continuous terminology "visual 
beamforming." 

The beamforming technique provides an array tuned to two 
different flavors of stimulus. Firstly, one source that lies on the 
intersection of the sensors' lines of sight (as just described for 



Photoreceptor Sampling = 2 x Spatial Frequency 




0 

nearly 1 00° o transmitted 



FIGURE 2 1 The loss of the sinusoidal signal when signal 
frequency equals sensor distribution frequency constitutes 
morphological resonance. Constructive interference, and hence 



audio beamforming). Secondly, multiple sources, one on each line 
of sight. This yields tuning to particular configurations of multi- 
ple signal sources, simply through spatial resonance between the 
configuration of transmitters and the configuration of receivers. 
In the case of sensors with parallel directional tuning, this "pre- 
ferred" configuration is the same configuration as that of the 
sensor array itself. Thus it functions as a "like me" configuration 
detector. This is the special case of Nyquist-Shannon sampling 
theorem, wherein signal frequency equals sensor distribution fre- 
quency, causing maximal constructive interference between the 
combined signals. See Figure 2. Prior to the development of 
convergent stereopsis, newborn binocular alignment is (approx- 
imately and noisily) parallel (Thorn et al., 1994), though this 
depends on what the infant is looking at (Slater and Findlay, 
1975). 

Multiple source signals often occur in the context of prob- 
lems in sensor array theory, indeed it is this sensitivity that is 
targeted by signal jamming techniques (Poisel, 2011). However, 
multiple transmitter, multiple receiver set-ups are widely used in 
wireless communication (termed "MIMO" multiple-input and 
multiple-output beamforming e.g., Raleigh and Cioffi, 1998) 
and may be highly relevant in the case of biological vision. 
Eyes are a particularly important example of a multiple visual 
signal source (Gliga and Csibra, 2007). Single transmitter and 
multiple transmitter beamforming specifies two ways in which 
a scene can be "like me" to a newborn infant; it can "fit" 
my sensor array by containing single sources which lie on the 
intersections of my bilateral lines of sight, and it can "fit" 
my sensor array by containing multiple sources which inde- 
pendently occupy both the lines of sight of a bilateral sen- 
sor pair. To our knowledge, this functional relationship between 
"visual beamforming," spatial-configural vision and "like-me" 
perception in the newborn is a novel proposal. However, the 
underlying mechanisms proposed are not; they are just those 
of standard binocular circuitry and established communications 
technology. 



Photoreceptor Sampling = Spatial Frequency 




nothing transmitted 



combined output of the array, is maximized in this case. Image 
slides courtesy of Austin Roorda, UC Berkeley http://roorda. vision. 
berkeley.edu. 
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3. A FORMAL MODEL 

In this section, we define a simplified model world in which to 
formalize our arguments and demonstrate the resultant effects in 
minimal computational simulations. 

3.1. A SIMPLE WORLD 

Let us first define a discrete World W with two spatial dimen- 
sions P, 8, and time T. P and © refer to the location (in polar 
coordinates) in space of arbitrary quanta each with one binary 
degree of freedom C. See Figure 3. We use lowercase letters to 
denote specific values of capitalized variables, and to denote spe- 
cific members of capitalized sets. Specifically, W is the set of all 
individual discrete World points w, each of which is characterized 
by four state variables p, 6, x, c. 

Vw{w 6 W} (2) 

W =< p, 6, X, c > 

C corresponds to the presence ("1") or absence ("0") of a visual 
signal source. Within this World, a subset of points w have c = 1 
(indicating a visual signal source). We denote this set Wc C W; 

Ww{w e W c : w c = 1} (3) 

3.2. BUILDING AN AGENT 

3.2. 1. A single visual sensor 

Let us introduce a binary visual sensor, which occupies one 
space/quanta w in the World W. It has a tight beam, perfectly 



straight line of sight. We treat the travel time of light as zero. 
It is convenient to place the sensor at the origin, such that for 
any given location in the World w that location is on the line of 
sight if and only if w% = Sensore. The line of sight is then a set L 
containing all locations w in the World satisfying 

Vw|w e L : Q w = 6 sensol -}. (4) 

The sensor output SEye is 1 just in case its line of sight contains at 
least one point w where w c = 1, that is if L intersects Wc, other- 
wise it outputs zero. This may be viewed conveniently as a logical 
disjunction (OR gate) with the entire line of sight as input, which 
we will denote S = vL. The presence of a sensor at a point w sets 
w c = 1, i.e., the sensor is also a visual signal source. 

3.2.2. A binocular agent 

Let us place two visual sensors in the World, oriented in parallel 
and separated by an inter-sensory distance <t>. We refer to this sen- 
sor pair as "eyes" for readability. They are connected via a logical 
AND gate. The output of this AND operation is denoted SEyes and 
may be described in full as 

SEyes = &(VlEyel, VL£ye2)- (5) 

This is a minimal analogy of mutual gain control, the AND oper- 
ation being equivalent to multiplication in the binary case. It is 
convenient to view SE yes as a single "meta-sensor" with a double 
aperture and a "forked" line of sight. Let us denote this forked LoS 
^Eyes, of which L£ ye i and LE ye 2 are subsets. SEy es = 1 if and only if 
both LEyei and LEye2 intersect Wc- 

3.2.3. Vergence 

Vergence eye movement changes the distance separating the visual 
axes (i.e., the LoS of the eyes), here denoted <t>, over depth, 
according to; 

O p = cj> - ptanG (6) 

where p is depth, and 6 is vergence angle in radians. Positive 6 cor- 
responds to convergence and negative to divergence. <t> and <J> are 
measured in the same units as p. With no vergence/parallel axes 
(i.e., 6 = 0), this reduces to $ = <J> at any depth. In the current 
model the sensors are assumed to be aligned in parallel, to keep 
things simple. 

3.2.4. A dyad of binocular agents 

In order to show how this set-up enables "like me" detection, let 
us introduce a second agent into the World (as in Figure 3). At 
this stage, it is useful to adopt the term "transceiver," which ref- 
erences a mechanism which is both a transmitter and a receiver. 
This terminology becomes useful in the dyadic case, as one inter- 
actor's sensor is the other's signal. If we place the two agents in 
alignment, their pairs of visual transceivers mutually satisfy one 
another's sensory condition SEyes = 1 (Figure 3). However, if we 
translate either agent relative to the other, they no longer stimu- 
late one another in this way. In order to see the other agent when 
it is not directly in front, a wider visual field of view is required. 



90 60 




270 

FIGURE 3 | A logpolar World, containing two "agents," one located at 
the origin, the other at distance of p = SO, and aligned with ("looking 
at") the other agent at the origin. The red dots correspond to the 
positions of the sensors of each agent. The semi-transparent red lines 
correspond to the LoS of the central visual sensors for each "eye." The 
semi-transparent blue line demarks the auditory midline, or line of 
equidistance from the two auditory sensors. This corresponds to the 
"beam" projected in beamforming. The blue circles represent the "mouth" 
of the agents. 
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FIGURE 4 | The spatiotemporal line of sight of two audio sensors 


in one spatial dimension plus time. The intersections 


are 


highlighted by magenta rings. With no delay, the intersection of the 


lines of sight is the point equidistant from the two sensors. By 


adding 


temporal delays to one sensor, the intersection can be 


translated in space. 







3.2.5. Wide field of vision 

Let us define an agent with numerous copies of the sensor 
arrangement just described, analogous to a "pin-hole camera." 
Each agent then has two wide angle sensors ("eyes"), while each 
"eye" consists of many individual light sensors ("rods") with tight 
beam lines of sight; a nested sensor array. Formally, this may be 
represented simply by expressing the above description in terms 
of pointwise vector or matrix operations. The calculations are just 
the same, but repeated in parallel across the visual field. With this 
wider field of vision ("FoV"), an agent may see things not only 
directly in front, but also off to the side. The spatial offset from 
center may then be used to guide orienting. Thus we define each 
eye as a vector of individual sensors; 

^Eyel = [SEyel,l> SEyel,2> SEyel,i> ■ •••SEye2,«] (7) 
^Eye2 = ["5Eye2.li ^Eye2,2> 5Eye2,z---^Eye2,n] 

and corresponding points (<£►) according to; 

SEyel.l O SEye2,l> (8) 
^Eyel.z ^ ^Eye2,i • • • 

For each corresponding bilateral sensor pair, there is an integra- 
tion node SEyes,«> yielding a vector of integration nodes SEyes,i...i- 
If any of these are 1, that is if the agent encounters binocular cor- 
relations in its field of view, then it goes in to "align" mode (see 
"Movement" below). The extent of the FoV for a visual sensor 
was arbitrarily set to be 0.1 it, but other values could be used. 
Sampling resolution a within this FoV was set to 0.0005 radians, 
but other values could be used. 

However, there is connection to audition via S only for 
the "foveal" sensor pair at the center of the field of vision 
SEyei.Center O S E ye2,Center (see "Inter-modal integration" below). 
Peripheral occurrences of SE yeSi> = 1 are used only for orienting, 
which brings the signal to align with S E y e i >C entei- S Ey e2,Center- 

3.2.6. Audio sensing and signalling 

Let us add audio capacity to the agents. The "mouth" is located 
midway between the "eyes" of the agent, and generates an audio 
event at each time step. The omnidirectional "ears" are co-located 
with the "eyes." However, they have a different LoS. The momen- 
tary LoS LEar of an ear may be defined as a subset of W wherein 
the spatial distance from source to sensor equals the temporal 
distance from source to sensor. This takes the form of a cone 
in spacetime. See Figure 4. Assuming the sensor is at the origin 
and sound travels at one spatial quanta per temporal quanta, this 
is just; 

lEar C W (9) 

Vw{w e L Ea r : x w = -p w }. 

The aural LoS are integrated in just the same way as the 
visual LoS; 

Sfiars = &(vi Ea rl, VlEari). (10) 



Ignoring time, the intersection LE ar i H lEye2 forms a straight line 
in space defined by equidistance from both sensors. See Figure 3. 
This is equivalent to the "beam" of directional tuning projected in 
beamforming (Figure 1). To avoid the inefficiency of simulating 
sound propagation, the code used for the current model imple- 
ments audio beamforming by hand. For each sound source, the 
code directly measures its offset from the auditory midline LEari l~) 
iEye2- This value can then be used to generate a motor command 
for auditory orienting as in Equation (12). Naturally, the real 
world case with reverberations etc. becomes much more complex, 
but the simplified implementation is sufficient for demonstrative 
purposes. 

3.2.7. Intermodal integration 

We now introduce an intermodal meta-sensor S, which com- 
bines the signals from the two unimodal meta-sensors i>Ears and 
SEyes,Center> again with an AND gate. This operation may be 
described in full as; 

S = &(&(VL Ea rl, VL E ar2), &(Vl E yel , Center » v iEye2,Center))-(H) 

S = 1 if and only if all the lines of sight of the agent's indi- 
vidual sensors intersect Wc (or equivalent in the appropri- 
ate modality e.g., Wa for audio events). Whether they inter- 
sect at the same point in Wc/Wa or different points corre- 
sponds to the difference between location tuning and config- 
uration tuning mentioned in "Visual beamforming" above. In 
terms of the momentary state of the agent's sensory appara- 
tus, however, these different ecological situations are indistin- 
guishable. The AND integration steps cannot decrease sparsity. 
Sparsity here corresponds directly to "selectiveness" in selec- 
tive attention; with respect to sparsity/selectivity, S > SsensorPair > 

^Sensor • 

Thus we arrive at an increasingly exclusive, hierarchical sen- 
sory definition of "like-me-ness," in a formal sense amenable to 
implementation on a robot. The higher a momentary sensory 
sample L reaches in the bilateral, multimodal hierarchy, topped 
here by S, the more attention it receives. 
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3.2.8. Movement 

We place one agent, representing the "subject" at the origin. This 
agent can only rotate on the spot. The other agent, representing 
the "stimulus," is positioned at a depth of 50 quanta and a ran- 
domized position in 0. It does not move. The stimulus is always 
oriented toward the subject at the origin, except where "averted 
gaze" is the experimental condition. See Figure 3. The default 
"searching" behavior for the subject is to rotate in a circle. If in 
the current timestep, SEyes,; = 1» anywhere in the visual field, the 
agent changes to an "align" behavior. When the subject is in align 
mode, it orients so as to center this stimulation by producing a 
motor command proportional to the offset of the correlated sen- 
sor pair from the center of the visual field. In the simulations 
where the ears and mouth are used, the subject also rotates to 
minimize the delay 8 Audio between the audio streams. 

A0 = r|(z' center - /correlated) + ^Audio (12) 

where r| is a motor gain constant for which we used r| = — 0.15cr 
(a denotes sampling resolution). Together with the sensory 
bias for the center mentioned in the previous subsection, this 
"orienting-to-center" effectively implements an a priori motor 
assumption of special status for the midline. This "center is spe- 
cial" assumption probably deserves closer scrutiny in terms of 
its biological basis and potential justification in terms of opti- 
mal sampling strategies, but this topic is beyond the scope of the 
current paper. 

4. AGENT-BASED SIMULATIONS 

In this section, we describe a series of simulations based in the 
simple world just described. The simulations address minimal 
analogs of three newborn abilities; perception of size constancy 
(Granrud, 1987; Slater et al., 1990), visual face perception (Goren 
et al., 1975; Johnson et al, 1991) and audio-visual alignment 
(Guellai and Streri, 2011). In each of the simulations here, the 
"task" for the subject is to find and align with the stimulus, 
despite increasing levels of distractors which we place into the 
world. Distractors are randomly dispersed visual events which 
are permanent and immobile throughout the course of a trial. 
Where audio distractors are also used, each visual distractor may, 
with a 25% chance at each timestep, also generate an audio 
event. As a quantitative measure of performance, we recorded 
the angular distance in 0 of the subject from alignment with the 
stimulus at the end of the trial, and term this value the "error." 
Each simulation was run 100 times and results averaged over all 
runs. 

4.1. SIMULATION 1.0— SIZE CONSTANCY 

Granrud (1987) and Slater et al. (1990) found that new- 
borns could perceive "size constancy" across changes in depth. 
Familiarization studies showed that the subjects could perceive a 
distinction between a patterned object and a double-sized (other- 
wise identical) object at double the distance, such that the size and 
form of the retinal projection were identical. The size constancy 
effect has been cited as some of the most convincing evidence 
for innate perceptual predispositions (Spelke, 1998; Streri et al., 
2013b). 



4.1.1. Results 

This simulation did not include audio capacity or events. A stimu- 
lus of equal size to the subject was compared to a stimulus of twice 
the size. For the "same size" condition, the depth p of the stimulus 
was increased in increments of 1 quanta from 50 to 100 quanta. 
In the "double size" condition, depth increased in increments of 2 
from 100 to 200 quanta; thus double the associated distance for 
the "same size" stimulus. Figure 5 displays mean error against 
distance for the two stimulus conditions. The distance of the 
stimulus from the subject makes little difference to the result; it 
is physical size of the configuration which is targeted by visual 
beamforming, and translation in depth makes no difference with 
parallel visual axes (until the limits of resolution are reached). See 
Figure 3. The subject finds the equal-size stimulus effectively, and 
ignores the double-sized stimulus equally effectively. 

This is of course a special case; the current model would not 
discriminate a double-sized from a triple-sized stimulus, as it 
would ignore both. More generally, otherwise identical stimuli 
of different physical size may project in an identical manner to 
one retina if depth compensates, but may project differently in 
terms of the spatial relations between the images on both retinae. 
Note that disparity sensitivity is not a prerequisite; interactions 
between corresponding points may in many cases be sufficient. 

4.7.2. Discussion 

Streri et al. (2013b), after Holway and Boring (1941), state that in 
order to recognize size constancy, infants need to combine projec- 
tive size with information about viewing distance. Neither of these 
values are used explicitly in the current model, which nonethe- 
less exhibits a size constancy effect. Perception of size constancy 
implies that the stimuli must be characterized by some difference 
in the way they impact the senses. If there were no detectable dif- 
ference between stimuli at all, the behavioral distinction would 
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FIGURE 5 | Size constancy over depth. The lines represents the error at 
the end of the trial. The red line represents the "same size" condition, and 
the blue line the "double size" condition. For a given depth d of the "same 
size" stimulus, the "double size" stimulus was at twice the distance 2d. 
Regardless of depth, the subject always finds and aligns with the "same 
size" stimulus, and ignores the "double size" stimulus. 
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be not just innate, but supernatural. If there is a sensory dif- 
ference, then a familiarization effect does not imply any specific 
functional adaptation. Binocular correlation patterns could pro- 
vide an informational basis for the discrimation between "small" 
and "far away" in the newborn. Monocular motion parallax might 
provide another potential source of the distinction. 

4.1.3. Prediction 

Size constancy in newborns will be lost under monocular viewing 
conditions. This would distinguish the potential contributions of 
binocular and monocular mechanisms. 

4.2. SIMULATION 2.0— FACE DETECTION 

Goren et al. (1975) and Johnson et al. (1991) found evidence for 
an innate preference for face-like stimuli. We recently showed that 
these results might result from a newborn "preference" for binoc- 
ular correlations (Wilkinson et al., in press). Existing theories of 
newborn face preference posit some kind of monocular neural 
filter applied to the retinal image (Morton and Johnson, 1991; 
Turati, 2004; Pitti et al, 2013). 

4.2. 1. Results — Binocular correlations versus monocular template 
matching 

We compared a subject using bilateral gain control with a subject 
using a "face template" monocularly in each eye. This template 
consists simply in seeing two or more visual signals at the same 
time, analogous to applying a "two blob" template across spatial 
scales. If such a situation occurs, the template subject rotates to 
center the position of the match, averaged between the two eyes. 
There is no audio in this first simulation. Figure 6 depicts average 
angular distance of the subject from the stimulus on the verti- 
cal axis, against increasing levels of distractors on the horizontal 
axis. Without distractors, finding the stimulus is a trivial matter 
of finding the only stimulus around, and so both BCM (red line) 
and template (blue line) methods perform very well. With high 
levels of distractors (Figure 7), the task is really quite difficult. 
As distractors are added and selective attention is required, the 
BCM subject substantially outperforms the template subject. This 
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FIGURE 6 | Face detection error rates (vertical axis) under increasing 
levels of distractors (horizontal axis). 



is because the BCM subject is tuned to a particular physical size 
of stimulus regardless of depth, whereas the monocular template 
subject confounds size with depth, and is therefore more prone 
to distractors. Confounding depth with size is a general feature of 
templates which match the monocular visual array. 

4.2.2. Results — averted gaze 

Studies have shown that good visual alignment of the stimulus 
with the subject ("direct gaze") plays an important role in new- 
born face perception (Farroni et al., 2002; Guellai and Streri, 
2011). To model these findings, we alter the alignment of the 
stimulus, such that it no longer "looks" directly at the subject, 
to provide a minimal analog of what is termed in the literature 
"averted gaze." We examined the effect of averted gaze on face 
detection, using the binocular correlation based detector (black 
line in Figure 6). The result is a roughly constant increase in the 
error rate over all distractor levels. This arises because although 
the subject is often able to find the averted gaze stimulus, it is 
unable to align precisely with it, and so oscillates around the 
stimulus instead of coming to rest at perfect alignment. 

4.2.3. Results — audio-visual face detection 

Adding the audio capabilities described in Audio Sensing and 
Signalling enables improvements in face detection, because S is 
more selective than SEy es (see "Intermodal integration" above). 
Audio distractors were used in this simulation. The stimulus 
was always in the "direct gaze" condition. The magenta line in 
Figure 6 shows the error with audio-visual beamforming. Even 
with high levels of distractors (Figure 7), the stimulus is usually 
located effectively; a configuration of audio and visual distractors 
matching the form of the stimulus is quite rare, even with up to 
20 distractors. 



90 60 




270 

FIGURE 7 | This image shows the world with 20 distractors. The task is 
quite difficult under these conditions, but the BCM does a reasonable job 
even with visual beamforming only. With audio-visual beamforming, 
performance is excellent even with high levels of distractors. 
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FIGURE 8 | The effects of averted gaze on attention and learning. The 

blue line shows mean error (over 100 trials) versus level of misalignment. 
The red line shows the mean amount of time spent usefully in "learning 
mode" (i.e., S = 1 , and the subject has the stimulus in view). These latter 
data have been scaled to fit on the graph. Misalignments up to about 0.07 
radians are mostly undetectable, and the subjects spend about half the trial 
on average (the expected mean amount if the subject usually finds the 
stimulus) in learning mode, with the stimulus in their field of vision. Over 
this level, there is a steep drop off in time spent learning. Over about 0.11 
radians, the subject's ability to locate the stimulus at all becomes 
increasingly poor. 



4.2.4. Discussion 

This minimal model demonstrates "like me" detection through 
audio-visual beamforming, in a manner which is highly depen- 
dent on good mutual alignment. The astute reader may note that 
in the real world, babies have different inter-pupillary distance 
("IPD") to their adult conspecifics. This is indeed an important 
factor. We have previously shown that this need not be a major 
problem in the real world case, as the spatial extension of the 
signal from the eyes can compensate for some difference in IPD 
(Wilkinson et al., in press). In general, allometry and alignment 
are relevant to the behavioral outcomes of bilateral gain control, 
and will generate individual and inter-trial differences. To keep 
things simple, we do not expand on this topic in the current 
article. 

4.2.5. Prediction 

Newborns will prefer stimuli which induce strong binocular cor- 
relations over monocular contrast matched controls which do 
not. A certain level of "face preference" will drop out of this more 
general effect. 

4.3. SIMULATION 3.0 FACE-VOICE INTEGRATION: THE ROLE OF DIRECT 
GAZE 

Sai (2005) elegantly demonstrated that newborns identify their 
mothers face via the association of her voice, which they already 
recognize from the prenatal period. In recent extensions of this 
work, subjects showed long term recognition effects for dynamic 
faces only when accompanied by speech, supporting an impor- 
tant role for multimodal stimulation in triggering/gating learning 
(Guellai et al., 2011). Beyond this, the faces had to fixate the 
infant; averted gaze abolished recognition effects (Guellai and 
Streri, 2011). Here we asked whether audio-visual beamforming 
would produce analogous dependencies on good alignment 
between subject and stimulus. 

In this simulation, the stimulus either aligned properly with 
the subject — the "direct gaze" condition — or is rotated out of 
alignment — the "averted gaze" condition. The amount of rota- 
tion was incrementally increased between 0.0 and 0.2 radians. We 
do not model learning per se in the current contribution, as would 
be required to model recognition effects completely. Instead, we 
constrain the discussion to selective attention, on the basis that 
attention defines what is learned. Learning is assumed to occur if 
and only if S = 1. We then measured the average (over 100 sim- 
ulations) amount of time the subject spends in "learning mode" 
(i.e., S = 1) under each increasing misalignment of the stimulus. 
There were no distractors in this simulation. 

4.3.1. Results 

The results of this simulation are displayed in Figure 8. In the 
"direct gaze" condition (0.0 radians), the subject finds the stimu- 
lus and aligns with it effectively. As a result, it goes into "learning 
mode" (i.e., S = 1) for the remainder of the simulation. Small 
misalignments up to about 0.07 radians are pretty much unde- 
tectable at the resolution used here. Over this level, the amount of 
time spent learning tends quickly to zero. This is because the sub- 
ject finds the stimulus, but then oscillates rapidly around it, trying 
but failing to align with both the audio and visual signals. As 



misalignment increases, the subject ceases to locate the stimulus 
effectively. 

Figure 9 displays results from two individual trials. In the 
direct gaze condition Figure 9A, both the audio and visual 
transceivers of the subject quickly align perfectly with those of the 
stimulus. With gaze averted by 0.1 radians Figure 9B, the subject 
tries unsuccessfully to align with the stimulus, resulting in a fast 
oscillation around the stimulus. The green line represents the dis- 
tance from synchrony of the binaural time signals. Note the fast 
oscillation, generated by the oscillation of the subject around the 
stimulus (Figure 10). Averted gaze means the subject cannot align 
properly with both the visual signal (broken blue line) and the 
audio signal at the same time, and so S = 0 at all times. "Learning 
mode" is obviously an oversimplification, but infants may be suf- 
fering from similar difficulty in aligning their sensor arrays with 
misaligned stimuli. 

4.3.2. Discussion 

These simulations show that bilateral gain control can in prin- 
ciple generate a strong dependence on alignment of multimodal 
signal sources. The beamforming technique projects the tuning 
of the sensor array to a particular subvolume of physical and con- 
figural space. A set of unimodal specifications combine to build 
up a multimodal attentional "template," which is projected onto 
the world as the combined multimodal "beams" of the array. 
This projected template describes an "ideal stimulus" in terms of 
what modalities of signal should be where and in what relations 
in physical space. The establishment and success of an interac- 
tion depends upon the extent and accuracy to which this set of 
selective conditions are met. 

We are implementing a slightly more sophisticated model of 
the BCM on the humanoid robot iCub. The robot's visual system 
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FIGURE 9 | Comparing individual simulations with "direct gaze" (A) 
and "averted gaze" (B) by 0.08 radians. With a direct gaze stimulus, 
the subject (cyan line) finds the stimulus (black line) and quickly aligns 
until error is zero. It then goes into learning mode (red line) for the 
rest of the trial. With the averted gaze stimulus, the subject finds the 



stimulus, but is unable to align precisely with it. This results in the 
oscillatory aligning movements visible in the gaze direction and auditory 
signal (green line). As a result of the failure to align precisely, the 
subject does not go into learning mode. See Figure 10 for a zoomed 
in view of these oscillations. 



FIGURE 10 | Close up of Figure 9 (A) "direct gaze" and (B) "averted 
gaze" by 0.08 radians. In the direct gaze condition, the audio signal (green 
line) aligns perfectly with the midline. The visual signal (blue line) does too. 
Hence "learning mode" (red line) goes on. In the averted gaze condition, 
this alignment fails to converge. The subject continues to oscillate, 
attempting to align with both the visual and auditory signals. The audio 
(green line) and visual (broken blue lines) signals oscillate around good 
alignment without ever both achieving it. Thus the subject never enters 
learning mode. 



consists of monocular spatiotemporal filters in a simple predictive 
coding architecture (Rao and Ballard, 1999), based on adaptive 
pattern generators (Righetti and Ijspeert, 2006). Sensory input 
is subjected to an active cancelation through negative feedback 
of the learned signal. Essentially this is a minimal implementa- 
tion of sensory adaptation; see Lieder et al. (2013) for a more 
realistic example of predictive coding applied to adaptation. The 
remaining sensory input is then subjected to bilateral mutual gain 
control. Audition uses purely temporal filters, but is otherwise 
based on the same mechanisms. It is worth noting that the exist- 
ing literature on active cancelation (Kuo and Morgan, 1999) might 
provide a useful bridge between predictive coding [in particular 
the influential active inference formulation (Friston, 2010)], and 
the radically embodied implications of array sensing. 

Attention is expressed through saccadic eye movements and 
neck movements, so as to centralize the source of the strongest 
region of post filter energy in the visual field, using the 
existing iCub oculomotor system. Vision is blocked during these 
movements. If at any given time, a "strong" (defined by some 



threshold) post-filter visual signal exists in the central region of 
vision, and is accompanied by strong binaural correlation (analo- 
gous to S = 1), then the robot enters an "attentive" state in which 
it records the ensuing audio-visual stream to disk. A value repre- 
senting this state is incremented whenever the above conditions 
are met, and otherwise decays slowly. We will then analyze the 
recorded periods, and assess their selectivity and reliability with 
respect to identifying the "interaction events" we instigate during 
the trial. This is obviously a much simplified model of the new- 
born case, but may still be instructive. The "social" functionality 
is anticipated partly because humans form the bulk of unpre- 
dictably moving objects in the lab (and a newborn babies world), 
and partly because their transceiver array is the right size and 
shape to cause bilateral correlations in multiple modalities (in a 
humanoid robot). We believe that identifying and carefully char- 
acterizing such ecological factors, together with defining minimal, 
biologically plausible mechanisms to exploit them, is crucial both 
to understanding how basic, naive social attention is apparently 
so easy for babies, and to giving robots a similar "social instinct." 

4.3.3. Predictions 

(1) In the psychophysical experiments detailed in Guellai and 
Streri (2011), "averted gaze" interfered with recognition 
effects. The BCM predicts that in a reproduction of this 
experiment, "averted voice" (or binaural misalignment of 
the sound source for the voice) will also impair recognition 
effects. Stereo speakers with delays could be used to manip- 
ulate the apparent source location of the voice in a subtle, 
non-intrusive way. 

(2) We predict that, in the cluttered, busy laboratory environ- 
ment, the robot described above will preferentially record to 
disk during "social interaction" events, in particular the case 
where an interaction partner visually aligns with, and talks 
to, the robot (which people in the lab will be asked to do). 
It will occasionally record other events which just happen 
to impact the senses in the right way to pass the filter. We 
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will also compare audio-visual beamforming against visual 
beamforming alone, and against mono mechanisms alone, 
in order to isolate the contributions of mono spatiotemporal 
filters and predictive coding, multi-modal coincidence, and 
bilateral mutual gain control. 

5. GENERAL DISCUSSION 

Bilaterality provides a spatially selective supramodal dimension 
along which to collapse the spatial and modal extent of the sensor 
array. Bilateral mutual gain control can enable spatial-configural 
tuning through beamforming. The existing evidence that bilateral 
gain control gates sensory flow in adults of various species practi- 
cally implies a downstream effect on the assortment of observable 
outcomes collectively termed "attention" (Li and Ebner, 2006; 
Ding et al, 2013; Wunderle et al, 2013; Xiong et al., 2013). 
The major open questions with respect to "innate" perception, 
attention, and sociality are the nature of bilateral interactions in 
the human neonate and the implications of this physiological 
organization on behavioral and ontogenic timescales. The for- 
mer can be addressed experimentally in neonates, while the latter 
can be explored via longitudinal empirical designs, mathematical 
modeling and developmental robotics (Morse et al, 201 1). 

The formal model and agent based simulations presented here 
are intended in a purely pedagogical sense, to specify and clarify 
our argument and demonstrate its consequences, and provide no 
evidence that newborn perception employs similar mechanisms. 
However, as for any useful theory, the current proposal does link 
back to the biological case via the predictions it makes regarding 
newborn behavior. Given knowledge of a newborn subject's bilat- 
eral sensory perspectives, a BCM based model can make real time, 
individual level salience predictions, with a precision that depends 
on the quality of perspective reproduction and the physiological 
detail of the model. If testing confirms these predictions, then the 
BCM will provide a powerful tool to understand early perceptual 
and social development. 

The fundamental generative prediction of the bilateral corre- 
lation model of newborn attention is that the spatial-configural 
attentional biases of newborn babies will be a function of the 
extent to which multimodal contrast energy in the scene projects 
synchronously to bilaterally corresponding points on the sub- 
jects sensor array, given whatever mono filters are in use. Many 
experimental paradigms are possible, but the classical preference- 
on-average over a population of subjects must be replaced by 
a paradigm focussed at the individual level, as the BCM makes 
predictions which are dependent on the real time allometry and 
alignment of the subject's sensor array. Stereo source manipula- 
tion of the auditory "sweet spot" offers an unintrusive stimulus 
paradigm for assessing the role of binaural correlation in newborn 
attention and learning. Screen based eye-tracking could enable 
the use of dynamic, responsive stimuli which can manipulate 
binocular correlation in real time for individual subjects. 

Our perspective is substantially in agreement with theories 
that innate sociality is based on perceptuomotor resonance and 
motor contagion (Meltzoff and Decety, 2003; Blakemore and 
Frith, 2005; Lepage and Theoret, 2007; Sugita, 2009; Pitti et al., 
2013). However, the BCM differs in taking more seriously the 
role of the body, and resulting sensory and ecological geometry, 



in which the brain and its activity are situated. Innate knowledge 
of the "like me" is embedded in the sampling biases implied by 
sensory morphology and behavioral repertoire, inviting a broader 
conception of the "mirror organism." The shared anatomical 
structure of interacting brains facilitates the interpersonal syn- 
chronization of brain activity (Dumas et al., 2012). Beamforming 
provides a mechanism by which shared corporal embodiment 
can play a similar role in mediating spatiotemporal alignment 
between interactors. It has been suggested that the "social brain 
network" and the "resting state network" may substantially over- 
lap (Schilbach et al., 2008). The current model carries a similar 
message at the corporal level; bilateral sensor distribution and 
integration can create a sampling bias for the "like me" which 
frames neural and bodily development in a deeply and intrinsi- 
cally social context from the earliest stages of sensory sampling 
and the beginnings of experience in the world. We suggest that the 
current proposal is best interpreted as a contribution to the recent 
literature developing an enactive theory of early social develop- 
ment (Varela et al., 1991; O'Regan and Noe, 2001; De Jaegher 
and Di Paolo, 2007; Auvray et al., 2009; De Jaegher et al, 2010; 
Di Paolo et al, 2010; Lenay et al., 201 1; Di Paolo and De Jaegher, 
2012; Froese et al., 2012; Lenay and Stewart, 2012). 

The BCM is relatively easy to implement on a robot, at least 
in the simple "newborn" form presented here. The basic require- 
ments are a bilaterally organized sensor array geometry, and an 
integration step based on mutual gain control. The nature of 
an audio or visual "event" in the current model is deliberately 
abstract. In the real world case, the choice of sensors and the 
mono filters applied will define what qualifies as an event. In the 
case of continuous valued sensor data, the binary AND used for 
intersensory integration may be replaced by multiplication or a 
more complicated gain control function, perhaps including nor- 
malization. The behavioral outcomes which emerge will depend 
on the form of the bilateral sensor pair's LoS and the bilateral 
morphology of the sensor array, shifting much of the explanatory 
burden for observable functional specificity to the embodiment 
of the agent and the ecological context. Therefore the design of 
the robot's sensor array becomes crucial to behavioral outcomes. 
Equally, this implies the natural selection can mould behavioral 
outcomes by moulding the phenotypic instantiation of sensor 
array morphology. Note for example the relatively large head and 
inter-pupillary distance of the human newborn (Pryor, 1969), 
which brings the newborn sensor array closer to good spatial tun- 
ing with that of adult conspecifics. This may be a candidate for 
a specific adaptation for newborn social interaction, though may 
also reflect other necessities such as a large brain. 

The BCM is in agreement with current gain control mod- 
els of bilateral integration (e.g., Ding et al, 2013) and interfaces 
cleanly with existing models attention based on gain modulation 
(Hillyard et al, 1998; Salinas and Sejnowski, 2001; Aston-Jones 
and Cohen, 2005; Reynolds and Heeger, 2009; Feldman and 
Friston, 2010; Sara and Bouret, 2012), spatiotemporal energy 
models of early vision (Mante and Carandini, 2005), and com- 
putational methods based on salience mapping (Itti and Koch, 
2001 ), wherein it can simply provide another contributor to over- 
all salience. Predictive coding (Rao and Ballard, 1999; Bastos 
et al., 2012) and its generalization in the free energy minimization 
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framework (Friston, 2010; Moran et al., 2013) are becoming 
increasingly influential as models of brain function. In this 
approach, a major factor influencing gain modulation is the pre- 
dictability of the signal (Feldman and Friston, 2010). Models 
based on "artificial curiosity" or "intrinsic motivation" (Barto, 
2013; Gottlieb et al, 2013) take a similar approach. Bilateral 
mutual gain control is more a priori, in that modulation is 
[ignoring potential interaction through bilateral normalization 
(Moradi and Heeger, 2009)], independent of the modulated sig- 
nal. Bilateral mutual gain control is also (we found) awkward to 
formalize and justify in terms of predictive coding. 

In terms of the information compression analogy, bilateral 
mutual gain control maps more comfortably to an embodied 
form of lossy transform coding, wherein the basis function in 
which a mono signal is recoded is simply (if unconventionally) 
the signal from its stereo twin, and vice versa. The application 
of lossy transform coding followed by lossless predictive cod- 
ing is standard practice in data compression (Clarke, 1995). The 
role of transform coding is to order information in a format 
where threshold based mechanisms can most effectively elimi- 
nate information according to some a priori value function. For 
example, in MP3, the audio time signal is subjected to Fourier 
transform and components which are undetectable to a compu- 
tational model of human hearing are discarded. The functional 
justification is reduced file size with little appreciable loss of sound 
quality for human listeners. In the BCM, signals which are not 
part of multimodal configurations "resonant" with the allome- 
try and alignment of the array — quantified in terms of bilateral 
correlation — are discarded/damped. The transform consists in 
the large scale neural morpho-architecture collapsing the space 
between bilateral sensor pairs. The sharpness of the filter's cut- 
off may be manipulated by mono resolution and by allowing a 
certain extent of spatiotemporal cross-correlation as well as pure 
correspondence. The functional justification is selective spatial- 
configural tuning to the "like me," in the broadest sense of the 
term. 

The combination of local (mono) decorrelation, followed by 
global (bilateral) correlation, may turn out to be justified on 
sparsity principles (Vinje and Gallant, 2000), and possibly on 
free-energy minimization principles (Feldman and Friston, 2010) 
as priors expressing the expectation that spatially neighboring 
locations in the world are likely to be correlated, whereas spatially 
distributed locations in the world are not. The thermodynamic 
unlikeliness of corporal form ensures that it is rarely approxi- 
mated by random happenings, providing selectivity, while hered- 
ity validates the assumption that resonant signals are likely to be 
interesting, providing relevance. 

The principle of beamforming through mutual gain control 
is not limited to the example of bilateral symmetry expanded 
here. An interesting and testable, if speculative, prediction is that 
a mutual gain control relation will exist between "correspond- 
ing points" on the fingertips of a single hand, across all age 
ranges. Work on this topic could help us to understand naive and 
skilled use of the hand as a sensor array to apprehend the eco- 
logical geometry and dynamics (Gibson, 1961, 1966, 1977), and 
resulting sensorimotor contingencies (O'Regan and Noe, 2001), 
of interacting with different shapes, densities, and textures. 



5.1. CONCLUSION 

We have sketched an uncontraversial argument for bilateral 
mutual gain control as a basic perceptual mechanism. Bilateral 
mutual gain control offers an effective and efficient multimodal 
"like me detector" available to both biological and robotic sys- 
tems. Some form of bilateral interaction in newborns is much 
more likely than no bilateral interaction given current evidence, 
and (perhaps immature) mutual gain control is the strongest can- 
didate for integration. Perhaps more controversially, we argue that 
this may explain a number of "innate predispositions" observed in 
newborn infants. Bilateral interactions have been largely ignored 
as potential causes of innate predispositions in the develop- 
mental psychology literature. Whilst we agree to a large extent 
with the notion of the competent newborn, the current paper 
is aimed to target questionable adaptationist, nativist, and inter- 
nalist assumptions regarding causal structure. Empirical work is 
required to establish the nature of bilateral interaction in the new- 
born human, and its potential relevance to particular abilities 
observed in newborns. Though the current model is a greatly sim- 
plified one, the mechanisms involved are very general. Ongoing 
work is extending the model to dynamic sensor distribution 
through behavior, simple perceptual learning/adaptation through 
predictive coding, interactive scenarios, and real world embod- 
iment on the humanoid robot iCub. Whether bilateral mutual 
gain control can account, fully or partially, for particular exist- 
ing empirical results is a case-by-case question, and we certainly 
do not want to assert that the BCM can fully explain newborn 
social skills. However, we do wish to push the point that ignor- 
ing bilateral integration has led to questionable interpretations of 
newborn abilities. In summary, the effects of bilateral integration 
should be considered and controlled for in the planning, execu- 
tion, and interpretation of psychophysical experiments investigat- 
ing perception, attention, and sociality in newborns. 
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